Finding Structure and Characteristics of Web Documents for Classification
نویسندگان
چکیده
Many Web documents containing the same type of information , would have similar structure. In this paper, we examine the problem of nding the structure of web documents and present a hierarchical structure to represent the relation among text data in the web documents. Due to the loose standard of web page publishing, diierent authors can use diierent wordings (labels) to label the same information. We introduced a labels discovery algorithm that uses the hierarchical structure extracted from the web pages. The algorithm discovers similar labels which describe the same kind of information. Such labels would help us nd the structure of the web documents. Experiments have shown that the algorithm can successfully discover similar labels and the structure obtained by our method can distinguish web pages accurately.
منابع مشابه
Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملOptimizing the Pre-Processing Phase of Automatic e-Document Classification
Electronic documents such as e-catalogs, e-mails, and Web documents have their own distinct characteristics that can be utilized in search and classification. They are structured, noisy, and, in some cases, related to each other. We analyze the characteristics of three major types of e-documents e-catalogs, e-mails, and Web documents and propose methods for optimizing automatic classification o...
متن کاملAn Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification
The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...
متن کاملAn Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification
The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000